Perfect Hashing and Probability
نویسنده
چکیده
A simple proof is given of the best known upper bound on the cardinality of a set of vectors of length t over an alphabet of size b, with the property that, for every subset of k vectors there is a coordinate in which they all differ. This question is motivated by the study of perfect hash functions. A set of vectors of length t over an alphabet of size b is called k-separated if for every k distinct vectors there is a coordinate in which they all differ. Let N(t, b, k) denote the largest cardinality of such a k-separated set of vectors. Thus, N = N(t, b, k) is the maximum size of a domain for which there exists a perfect family of t (b, k) hash-functions, that is, a family of t functions mapping a domain of size N into a set of size b so that every subset of k elements of the domain is mapped in a one to one fashion by at least one of the functions. The problem of estimating the function N(t, b, k), which is motivated by the numerous applications of perfect hashing in theoretical computer science, received a considerable amount of attention. The interesting case is that in which t is much bigger than b (and, of course, b ≥ k). Fredman and Komlós [2], (see also [4]) proved that 1 k − 1 log 1 1− g(b, k) ∼ 1 t log N(t, b, k) (1) and that 1 t log N(t, b, k) ∼ g(b, k − 1) log(b− k + 2) (2) where g(b, k) = b(b − 1) · · · (b − k + 1)/bk and where here, and in what follows, the notation A ∼B means that A ≤ (1 + o(1))B and the o(1) term tends to zero as t (or N) tends to infinity. The first inequality is proved by a simple probabilistic argument which demonstrates the alteration method discussed in, for example, Chapter 3 of [1]; one chooses an appropriate number of vectors randomly, shows that the expected number of non-separated k-tuples is small, and omits a vector from each such bad k-tuple. The proof of the second inequality is much more difficult, and relies on certain techniques from information theory. The information theoretic approach is clarified in [4], where (2) is rederived by applying properties of the so-called notion of graph-entropy, introduced by Körner. This approach is extended in [5], where the authors introduce the notion of hypergraph entropy and apply it together with various information theoretic arguments to obtain a strengthening of (2), namely 1 t log N(t, b, k) ∼ min 0≤j≤k−2 g(b, j + 1) log b− j k − j − 1 . (3) Our objective in this note is to present a short and simple probabilistic proof of (3) which requires no information theoretic tools. Lemma.Let F be a family of m vectors of length t over the alphabet C ∪{∗} where C = {1, 2, . . . , c} and let xv denote the number of non-∗ coordinates of v ∈ F . Let x = ∑ xv/m be the average value of xv. If for every d distinct vectors in F there is a coordinate in which they are all different from ∗ and are all distinct, then m ≤ (d− 1)( c d− 1 ). Mailing address: A. Nilli c/o N. Alon, Department of Mathematics, Tel Aviv University, Tel Aviv, Israel Proof. For every coordinate i choose, randomly and independently a subset Di of cardinality d − 1 of C. Call a vector v ∈ F consistent if for every i, vi ∈ Di ∪ {∗}. The assumption clearly implies that for any choice of the sets Di there are no more than d− 1 consistent vectors, and the conclusion thus follows from the fact that the expected number of consistent vectors is ∑ v∈F ( d− 1 c )v ≥ m(d− 1 c ), where the last inequality follows from the convexity of the function g(z) = (d−1 c ) z. 2. Proof of (3). Let G be a k-separated family of N = N(t, b, k) vectors of length t over an alphabet B of size b and suppose 0 ≤ j ≤ k− 2. Let us say that a set of j + 1 vectors is separated at coordinate i if their values in this coordinate are all distinct. It is not difficult to see that the fraction of j + 1subsets of vectors which is separated by a given coordinate is at most (1 + o(1))g(b, j + 1). (This follows from Muirhead’s Inequality, [3], page 44). Let J ⊂ G be a randomly chosen set of j vectors, and let v be a random vector in G − J . Let xv denote the number of coordinates of v in which all the vectors in J ∪ {v} have distinct values. By the above remark, the expectation of xv is at most (1+o(1))g(b, j+1)t and hence there is a fixed subset J for which the average value x of xv for v ∈ G−J is at most (1 + o(1))g(b, j + 1)t. We now define a family F of N − j vectors over {1, 2, . . . , b− j, ∗} as follows. For each v ∈ G − J , define a member v′ ∈ F by having the value of v′ in coordinate number i being ∗ unless J ∪ {v} is separated at this coordinate. In this latter case, we define v′ i = p if the value of vi is the pth largest element in the set B − {ui : u ∈ J}. Note that for every set S of d = k − j vectors in F there must be a coordinate in which the values of all these vectors are distinct and differ from ∗, since otherwise the k-set S ∪ J is not separated. Therefore, by the lemma with c = b − j, d = k − j and m = |F| = N − j, N − j ∼ (k − j − 1)( b− j k − j − 1 ), implying (3). 2 Acknowledgement I would like to thank N. Alon and A. Orlitsky for helpful discussions and N. Alon for his help in writing this manuscript.
منابع مشابه
Quasi-Perfect Hashing
The idea of quasi-perfect hashing is introduced and applied to solve the static dictionary problem. Given a universe U and a set S of n distinct keys belonging to U , we propose a quasi-perfect hash function which allows one to find a key from S, stored in the hash table of size m, m ≥ n, in O(1) time. While looking up a key at most two probes in the hash table are made. Our main motivation is ...
متن کاملDe Dictionariis Dynamicis Pauco Spatio Utentibus (lat. On Dynamic Dictionaries Using Little Space)
We develop dynamic dictionaries on the word RAM that use asymptotically optimal space, up to constant factors, subject to insertions and deletions, and subject to supporting perfect-hashing queries and/or membership queries, each operation in constant time with high probability. When supporting only membership queries, we attain the optimal space bound of Θ(n lg u n ) bits, where n and u are th...
متن کاملUsing Tries to Eliminate Pattern Collisions in Perfect Hashing
4any current perfect hashing algorithms suffer from the problem of pattern collisions. In this paper, a perfect hashing technique that uses array-based tries and a simple sparse matrix packing algorithm is introduced. This technique eliminates all pattern collisions, and because of this it can be used to form ordered minimal perfect hash functions on extremely large word lists. This algorithm i...
متن کاملLecture 10 — March 20 , 2012
In the last lecture, we finished up talking about memory hierarchies and linked cache-oblivious data structures with geometric data structures. In this lecture we talk about different approaches to hashing. First, we talk about different hash functions and their properties, from basic universality to k-wise independence to a simple but effective hash function called simple tabulation. Then, we ...
متن کاملPerfect hashing using sparse matrix packing
This article presents a simple algorithm for packing sparse 2-D arrays into minimal I-D arrays in O(r?) time. Retrieving an element from the packed I-D array is O(l). This packing algorithm is then applied to create minimal perfect hashing functions for large word lists. Many existing perfect hashing algorithms process large word lists by segmenting them into several smaller lists. The perfect ...
متن کاملA Survey on Efficient Hashing Techniques in Software Configuration Management
This paper presents a survey on efficient hashing techniques in software configuration management scenarios. Therefore it introduces in the most important hashing techniques as open hashing, separate chaining and minimal perfect hashing. Furthermore we evaluate those hashing techniques utilizing large data sets. Therefore we compare the hash functions in terms of time to build the data structur...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008